基于机器学习算法的HIV-1蛋白酶抑制剂的QSAR模型建立以及应用域分析

Quantitative structure-activity relationship (QSAR) models and their applicability domain analysis on HIV-1 protease inhibitors by machine learning methods

Tian, Y.J.; Zhang, S.D.; Yin, H.Y.; Yan, A.X.*
Chemometrics and Intelligent Laboratory Systems, 2020, 196, 103888

    HIV-1蛋白酶抑制剂(PIs)对人类免疫缺陷病毒(HIV)的高效抗逆转录病毒治疗(HAART)做出了重要贡献。本研究采用多元线性回归(MLR)、 支持向量机(SVM)、随机森林(RF)和深度神经网络(DNlN)四种机器学习方法,对1238个PI建立了14个定量构效关系(QSAR)模型。 对于由DNN算法构建的最优模型Model2G,其在训练集和测试集上分别得到决定系数(R2)分别为0.88和0.79,均方根误差(RMSE)为0.39和0.51。 对于Model 2G,基于训练集得到的应用域阈值ADT为1.765,一个相似度距离(d)小于ADT的化合物被认为在应用域内,模型对该化合物可以准确预测, 测试集中65.37%的化合物可被可靠预测。此外,将1238个PI手工分为8个子集,包含不同的支架。结果发现,与其他子集相比,羟胺衍生物和七元环尿素衍生物表现出较高的抑制活性。 我们还对299个羟胺衍生物抑制剂(Dataset2)和377个七元环环尿素衍生物抑制剂(Dataset3)的两个子集用SVM、RF和DNN方法建立了QSAR模型。在Dataset2上最好的模型是Model3A, 其在测试集上的R2为0.71,RMSE为0.53。在Dataset3上最好的模型是Model4B,其在测试集上的R2为0.82,RMSE为0.51。最后,我们分析了在这两个子集中对抑制剂生物活性作出重大贡献的描述符。 研究发现,七元环尿素衍生物的高活性抑制剂通常含有多个芳香族氮杂环取代基,如咪唑和吡唑。恶唑烷酮和磺胺主要出现在羟胺衍生物的高活性抑制剂中。 这些观察结果可进一步用于设计有前景的HIV-1蛋白酶抑制剂。

阅读文章原文

下载原始数据

Download Supporting Information

    HIV-1 protease inhibitors (PIs) make a vital contribution on highly active antiretroviral therapy (HAART) of human immunodeficiency virus (HIV). In this study, 14 quantitative structure-activity relationship (QSAR) models on 1238 PIs were built by four machine learning methods, including multiple linear regression (MLR), support vector machine (SVM), random forest (RF) and deep neural networks (DNlN). For the best model Model2G constructed by DNN algorithm, the coefficient of determination (R2) of 0.88 and 0.79, the root mean squared error (RMSE) of 0.39 and 0.51 were obtained on training set and test set, respectively. For model Model2G, the applicability domain threshold (ADT) of 1.765 was obtained for training set, a compound that has a similarity distance (d) less than the ADT is considered to be inside the applicability domain, could be predicted accurately, and thus 65.37% compounds in test set performed reliable. In addition, the 1238 PIs were manually divided into eight subsets containing different scaffolds. It was found that hydroxylamine derivatives and seven-member cyclic urea derivatives showed highly inhibitory activity comparing with other subsets. We also built QSAR models with SVM, RF and DNN methods on two subsets of 299 hydroxylamine derivatives inhibitors (Dataset2) and 377 seven-member cyclic urea derivatives inhibitors (Dataset3). For the best model Model3A on Dataset2, R2of 0.71 and RMSE of 0.53 were obtained for test set. For the best model Model4B on Dataset3, R2 of 0.82 and RMSE of 0.51 were obtained for test set. At last, we analyzed the descriptors which make significant contributions on the bioactivity of inhibitors among these two subsets. It was found that highly active inhibitors of seven-member cyclic urea derivatives usually contained several aromatic nitrogen heterocyclic ring substituents such as the inidazole and the pyrazole. The oxazolidinone group and sulfanilamide mainly appeared in highly active inhibitors of hydroxylamine derivatives. These observations may be utilized further in designing promising HIV-1 protease inhibitors.

Read More

Models performance:   Dataset 1 (1238 inhibitors)

Model Name Algorithm Descriptors Training set R2 Training set RMSE Test set R2 Test set RMSE
Model 1A MLR 21 RDKit descriptors 0.55 0.75 0.56 0.77
Model 1B MLR 22 RDKit descriptors 0.57 0.73 0.55 0.76
Model 1C SVM 21 RDKit descriptors 0.89 0.38 0.73 0.60
Model 1D SVM 21 RDKit descriptors 0.90 0.35 0.76 0.56
Model 1E RF 22 RDKit descriptors 0.86 0.42 0.75 0.58
Model 1F RF 24 RDKit descriptors 0.86 0.41 0.74 0.59
Model 1G DNN 63 RDKit descriptors 0.91 0.36 0.76 0.57
Model 2A MLR 25 RDKit descriptors 0.54 0.77 0.57 0.73
Model 2B MLR 26 RDKit descriptors 0.56 0.75 0.57 0.72
Model 2C SVM 24 RDKit descriptors 0.84 0.46 0.76 0.55
Model 2D SVM 13 RDKit descriptors 0.83 0.47 0.76 0.54
Model 2E RF 23 RDKit descriptors 0.85 0.44 0.76 0.55
Model 2F RF 12 RDKit descriptors 0.85 0.44 0.74 0.56
Model 2G DNN 69 RDKit descriptors 0.88 0.39 0.79 0.51

Dataset 2:    299 hydroxylamine derivatives inhibitors

Model Name Algorithm Descriptors Training set R2 Training set RMSE Test set R2 Test set RMSE
Model 3A RF 25 RDKit descriptors 0.89 0.37 0.71 0.53
Model 3B SVM 16 RDKit descriptors 0.84 0.38 0.64 0.6
Model 3C RF 22 RDKit descriptors 0.78 0.43 0.61 0.56
Model 3D SVM 18 RDKit descriptors 0.8 0.41 0.65 0.62
Model 3E DNN 68 RDKit descriptors 0.90 0.30 0.69 0.59

Dataset 3:    377 cyclic urea derivatives inhibitors

Model Name Algorithm Descriptors Training set R2 Training set RMSE Test set R2 Test set RMSE
Model 4A RF 25 RDKit descriptors 0.90 0.28 0.74 0.53
Model 4B SVM 23 RDKit descriptors 0.87 0.43 0.82 0.51
Model 4C RF 18 RDKit descriptors 0.91 0.27 0.78 0.56
Model 4D SVM 24 RDKit descriptors 0.85 0.44 0.73 0.59
Model 4E DNN 68 RDKit descriptors 0.94 0.27 0.81 0.50

主要项目成员

田钰嘉

博士研究生

1204429112@qq.com

张声德

硕士研究生